Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions
نویسنده
چکیده
page text describes associated images, and images are not captioned consistently. Content-based image retrieval systems that analyze the images themselves1 are progressing, but the systems require considerable image-preprocessing time. Furthermore, surveys of users doing image retrieval show that users are more interested in the identification of objects and actions depicted by images than in the color, shape, and other visual properties that most contentbased retrieval systems provide.2 Because object and action information is more easily obtained from captions, caption-based retrieval appears to be the only hope for broadly useful image retrieval.3 Commercial tools such as AltaVista’s Image Search search engine achieve respectable precision (the fraction of correct answers retrieved out of all answers retrieved) by indexing only “easy” pages, such as photograph libraries where images are one to a page and captions are easy to identify. Recall (the fraction of correct answers retrieved out of all correct answers) is equally or more important than precision, but users often do not realize how small it is for their queries. In experiments with 10 representative phrases, using pages retrieved by a traditional keyword-based Alta Vista search to calculate recall, we found Alta Vista’s Image Search had a precision of 0.46 and recall of 0.10. Higher recall requires dealing with a large variety of page layout formats and styles of captioning. Recent work has made important progress on general image indexing from the Web by intelligent information filtering of Web text.4–6 By looking for the right clues, large amounts of Web page text can be excluded as captions for any given image, and the captions in the remaining text can be inferred. Clues can include caption candidate wording, HTML constructs around the candidate, distance from the associated image, image-file name words, and associated image properties. These clues reduce the amount of text to examine to find captions, and the reduced text can be indexed and used for keyword-based retrieval. But so far, the selection of these clues has been intuitive, and there has been no careful study of the relative values of clues. This article reports on Marie-4 (see Figure 1), our latest in a series of caption-based image-retrieval systems.7 Marie-4 uses a wide range of clues, broader than any system we know about, to locate image-caption pairs in HTML Web pages. It is in part an expert system where the knowledge used is not especially novel in itself, but the synergy of a variety of knowledge working together provides surprisingly good performance. Unlike some caption-based retrieval systems3 and previous Marie systems, which require an image database with captions already extracted, Marie-4 is a Web crawler that autonomously searches the Web, locates captions using intelligent reasoning, and indexes them. It does not attempt full natural language processing and does not require the elaborate lexicon information of the earlier prototypes, so it is more flexible. Marie-4, a Web
منابع مشابه
Quantitative evaluation of recall and precision of CAT Crawler, a search engine specialized on retrieval of Critically Appraised Topics
BACKGROUND Critically Appraised Topics (CATs) are a useful tool that helps physicians to make clinical decisions as the healthcare moves towards the practice of Evidence-Based Medicine (EBM). The fast growing World Wide Web has provided a place for physicians to share their appraised topics online, but an increasing amount of time is needed to find a particular topic within such a rich reposito...
متن کاملPrecise and Efficient Retrieval of Captioned Images: The MARIE Project
THEMARIE PROJECT HAS EXPLORED knowledge-based information retrieval of captioned images of the kind found in picture libraries and on the Internet. It exploits the idea that images are easier to understand with context, especially descriptive text near them, but it also does image analysis. The MARIE approach has five parts: ( 1 ) find the images and captions; (2) parse and interpret the captio...
متن کاملCaption Crawler: Enabling Reusable Alternative Text Descriptions using Reverse Image Search
Accessing images online is often difficult for users with vision impairments. This population relies on text descriptions of images that vary based on website authors’ accessibility practices. Where one author might provide a descriptive caption for an image, another might provide no caption for the same image, leading to inconsistent experiences. In this work, we present the Caption Crawler sy...
متن کاملFrom Focused Crawling to Expert Information: an Application Framework for Web Exploration and Portal Generation
Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. It typically starts from a useror communityspecific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can efficiently build a theme-specific, hierarchical directory whose nodes are populate...
متن کاملFinding Photograph Captions Multimodally on the World Wide Web
Several software tools index text of the World Wide Web, but little attention has been paid to the many valuable photographs. We present a relatively simple way to index them by localizing their likely explicit and implicit captions with a kind of expert system. We use multimodal clues from the general appearance of the image, layout of the Web page, and the words nearby the image that are like...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Intelligent Systems
دوره 17 شماره
صفحات -
تاریخ انتشار 2002